Dear Candidate,
Thank you for spending your valuable time with us during the Logitech interview. We are happy to announce that you have been shortlisted for the next phase of our hiring.
Attached below is some simple mock up data. The intention is a simple analytical exercise for you, to enable us to assess your coding and analytical skills.
You are not expected to spend more than 2 hours on this assignment.
What we are looking for:
(Code and output hidden; see .rmd for code)
| Category1 | Category2 | Category3 | 10-Dec | 11-Jan | 11-Feb | 11-Mar | 11-Apr | 11-May | 11-Jun | 11-Jul | 11-Aug | 11-Sep | 11-Oct | 11-Nov | 11-Dec | 12-Jan | 12-Feb | 12-Mar | 12-Apr | 12-May | 12-Jun | 12-Jul | 12-Aug | 12-Sep | 12-Oct | 12-Nov | 12-Dec | 13-Jan | 13-Feb | 13-Mar | 13-Apr | 13-May | 13-Jun | 13-Jul | 13-Aug | 13-Sep | 13-Oct | 13-Nov | 13-Dec | 14-Jan | 14-Feb | 14-Mar | 14-Apr | 14-May | 14-Jun | 14-Jul | 14-Aug | 14-Sep | 14-Oct | 14-Nov | 14-Dec | 15-Jan | 15-Feb | 15-Mar | 15-Apr | 15-May | 15-Jun | 15-Jul | 15-Aug | 15-Sep | 15-Oct | 15-Nov | 15-Dec | 16-Jan | 16-Feb | 16-Mar | 16-Apr | 16-May | 16-Jun | 16-Jul | 16-Aug |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| A | X | W | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30 | 0 | 0 | 5 | 0 | 0 | 570 | 2061 | 13822 | 16730 | 13178 | 9814 | 10166 | 6495 | 27470 | 57135 | 24230 | 22576 | 25627 | 21283 | 21486 | 31879 | 25246 | 27515 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| A | A | A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 445387 | 409590 | 446587 | 313901 | 294959 | 371677 | 311436 | 342033 | 386121 | 285165 | 301804 | 508148 | 278061 | 310467 | 358239 | 248998 | 232080 | 302672 | 267015 | 322004 | 374625 | 297737 | 452887 | 705445 | 258244 | 298029 | 339785 | 233427 | 224777 | 291458 | 241622 | 293511 | 356579 | 264255 | 337553 | 679924 | 289284 | 319954 | 351710 | 253872 | 261584 | 340386 | 275873 | 332474 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| A | A | B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 734161 | 685919 | 789018 | 509878 | 488771 | 640080 | 466612 | 598391 | 639250 | 410131 | 556360 | 984537 | 427590 | 498764 | 575448 | 378067 | 381543 | 481508 | 390322 | 474441 | 542372 | 394104 | 442740 | 813362 | 395002 | 470135 | 464293 | 348825 | 327153 | 388646 | 311547 | 387380 | 448017 | 294690 | 366460 | 715109 | 305632 | 315760 | 369734 | 261899 | 239278 | 297683 | 250246 | 292801 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| A | A | C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 35450 | 32177 | 40849 | 25757 | 27031 | 39493 | 30014 | 37560 | 39276 | 26281 | 30080 | 67193 | 23758 | 39006 | 41006 | 24468 | 23331 | 37436 | 28907 | 34107 | 34987 | 19505 | 25061 | 65220 | 25691 | 27354 | 28577 | 21515 | 22181 | 23848 | 30147 | 24904 | 25861 | 18881 | 25321 | 48449 | 20415 | 23348 | 27214 | 18392 | 23174 | 20951 | 17712 | 18621 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Rows: 282
Columns: 72
$ Category1 <chr> NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, …
$ Category2 <chr> NA, NA, "X", NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, …
$ Category3 <chr> NA, NA, "W", NA, NA, "A", NA, NA, "B", NA, NA, "C", NA, NA, …
$ `10-Dec` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jan` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Feb` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Mar` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Apr` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-May` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jun` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jul` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Aug` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Sep` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Oct` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Nov` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Dec` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jan` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Feb` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Mar` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Apr` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-May` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jun` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jul` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Aug` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Sep` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Oct` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Nov` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Dec` <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `13-Jan` <dbl> NA, NA, 0, NA, NA, 445387, NA, NA, 734161, NA, NA, 35450, NA…
$ `13-Feb` <dbl> NA, NA, 0, NA, NA, 409590, NA, NA, 685919, NA, NA, 32177, NA…
$ `13-Mar` <dbl> NA, NA, 0, NA, NA, 446587, NA, NA, 789018, NA, NA, 40849, NA…
$ `13-Apr` <dbl> NA, NA, 0, NA, NA, 313901, NA, NA, 509878, NA, NA, 25757, NA…
$ `13-May` <dbl> NA, NA, 0, NA, NA, 294959, NA, NA, 488771, NA, NA, 27031, NA…
$ `13-Jun` <dbl> NA, NA, 0, NA, NA, 371677, NA, NA, 640080, NA, NA, 39493, NA…
$ `13-Jul` <dbl> NA, NA, 0, NA, NA, 311436, NA, NA, 466612, NA, NA, 30014, NA…
$ `13-Aug` <dbl> NA, NA, 0, NA, NA, 342033, NA, NA, 598391, NA, NA, 37560, NA…
$ `13-Sep` <dbl> NA, NA, 0, NA, NA, 386121, NA, NA, 639250, NA, NA, 39276, NA…
$ `13-Oct` <dbl> NA, NA, 0, NA, NA, 285165, NA, NA, 410131, NA, NA, 26281, NA…
$ `13-Nov` <dbl> NA, NA, 0, NA, NA, 301804, NA, NA, 556360, NA, NA, 30080, NA…
$ `13-Dec` <dbl> NA, NA, 0, NA, NA, 508148, NA, NA, 984537, NA, NA, 67193, NA…
$ `14-Jan` <dbl> NA, NA, 0, NA, NA, 278061, NA, NA, 427590, NA, NA, 23758, NA…
$ `14-Feb` <dbl> NA, NA, 0, NA, NA, 310467, NA, NA, 498764, NA, NA, 39006, NA…
$ `14-Mar` <dbl> NA, NA, 0, NA, NA, 358239, NA, NA, 575448, NA, NA, 41006, NA…
$ `14-Apr` <dbl> NA, NA, 0, NA, NA, 248998, NA, NA, 378067, NA, NA, 24468, NA…
$ `14-May` <dbl> NA, NA, 0, NA, NA, 232080, NA, NA, 381543, NA, NA, 23331, NA…
$ `14-Jun` <dbl> NA, NA, 0, NA, NA, 302672, NA, NA, 481508, NA, NA, 37436, NA…
$ `14-Jul` <dbl> NA, NA, 0, NA, NA, 267015, NA, NA, 390322, NA, NA, 28907, NA…
$ `14-Aug` <dbl> NA, NA, 0, NA, NA, 322004, NA, NA, 474441, NA, NA, 34107, NA…
$ `14-Sep` <dbl> NA, NA, 30, NA, NA, 374625, NA, NA, 542372, NA, NA, 34987, N…
$ `14-Oct` <dbl> NA, NA, 0, NA, NA, 297737, NA, NA, 394104, NA, NA, 19505, NA…
$ `14-Nov` <dbl> NA, NA, 0, NA, NA, 452887, NA, NA, 442740, NA, NA, 25061, NA…
$ `14-Dec` <dbl> NA, NA, 5, NA, NA, 705445, NA, NA, 813362, NA, NA, 65220, NA…
$ `15-Jan` <dbl> NA, NA, 0, NA, NA, 258244, NA, NA, 395002, NA, NA, 25691, NA…
$ `15-Feb` <dbl> NA, NA, 0, NA, NA, 298029, NA, NA, 470135, NA, NA, 27354, NA…
$ `15-Mar` <dbl> NA, NA, 570, NA, NA, 339785, NA, NA, 464293, NA, NA, 28577, …
$ `15-Apr` <dbl> NA, NA, 2061, NA, NA, 233427, NA, NA, 348825, NA, NA, 21515,…
$ `15-May` <dbl> NA, NA, 13822, NA, NA, 224777, NA, NA, 327153, NA, NA, 22181…
$ `15-Jun` <dbl> NA, NA, 16730, NA, NA, 291458, NA, NA, 388646, NA, NA, 23848…
$ `15-Jul` <dbl> NA, NA, 13178, NA, NA, 241622, NA, NA, 311547, NA, NA, 30147…
$ `15-Aug` <dbl> NA, NA, 9814, NA, NA, 293511, NA, NA, 387380, NA, NA, 24904,…
$ `15-Sep` <dbl> NA, NA, 10166, NA, NA, 356579, NA, NA, 448017, NA, NA, 25861…
$ `15-Oct` <dbl> NA, NA, 6495, NA, NA, 264255, NA, NA, 294690, NA, NA, 18881,…
$ `15-Nov` <dbl> NA, NA, 27470, NA, NA, 337553, NA, NA, 366460, NA, NA, 25321…
$ `15-Dec` <dbl> NA, NA, 57135, NA, NA, 679924, NA, NA, 715109, NA, NA, 48449…
$ `16-Jan` <dbl> NA, NA, 24230, NA, NA, 289284, NA, NA, 305632, NA, NA, 20415…
$ `16-Feb` <dbl> NA, NA, 22576, NA, NA, 319954, NA, NA, 315760, NA, NA, 23348…
$ `16-Mar` <dbl> NA, NA, 25627, NA, NA, 351710, NA, NA, 369734, NA, NA, 27214…
$ `16-Apr` <dbl> NA, NA, 21283, NA, NA, 253872, NA, NA, 261899, NA, NA, 18392…
$ `16-May` <dbl> NA, NA, 21486, NA, NA, 261584, NA, NA, 239278, NA, NA, 23174…
$ `16-Jun` <dbl> NA, NA, 31879, NA, NA, 340386, NA, NA, 297683, NA, NA, 20951…
$ `16-Jul` <dbl> NA, NA, 25246, NA, NA, 275873, NA, NA, 250246, NA, NA, 17712…
$ `16-Aug` <dbl> NA, NA, 27515, NA, NA, 332474, NA, NA, 292801, NA, NA, 18621…
Category1 Category2 Category3 10-Dec 11-Jan 11-Feb 11-Mar 11-Apr
188 188 188 188 188 188 188 188
11-May 11-Jun 11-Jul 11-Aug 11-Sep 11-Oct 11-Nov 11-Dec
188 188 188 188 188 188 188 188
12-Jan 12-Feb 12-Mar 12-Apr 12-May 12-Jun 12-Jul 12-Aug
188 188 188 188 188 188 188 188
12-Sep 12-Oct 12-Nov 12-Dec 13-Jan 13-Feb 13-Mar 13-Apr
188 188 188 188 188 188 188 188
13-May 13-Jun 13-Jul 13-Aug 13-Sep 13-Oct 13-Nov 13-Dec
188 188 188 188 188 188 188 188
14-Jan 14-Feb 14-Mar 14-Apr 14-May 14-Jun 14-Jul 14-Aug
188 188 188 188 188 188 188 188
14-Sep 14-Oct 14-Nov 14-Dec 15-Jan 15-Feb 15-Mar 15-Apr
188 188 188 188 188 188 188 188
15-May 15-Jun 15-Jul 15-Aug 15-Sep 15-Oct 15-Nov 15-Dec
188 188 188 188 188 188 188 188
16-Jan 16-Feb 16-Mar 16-Apr 16-May 16-Jun 16-Jul 16-Aug
188 188 188 188 188 188 188 188
[1] 13536
| Category1 | Category2 | Category3 | n |
|---|---|---|---|
| NA | NA | NA | 188 |
| A | J | NULL | 5 |
| A | J | O | 5 |
| A | C | NULL | 4 |
| B | C | W | 3 |
| C | C | W | 3 |
| A | C | W | 2 |
| A | A | A | 1 |
| A | A | B | 1 |
| A | A | C | 1 |
| A | A | E | 1 |
| A | A | I | 1 |
| A | A | M | 1 |
| A | A | V | 1 |
| A | A | X | 1 |
| A | B | P | 1 |
| A | B | Q | 1 |
| A | B | R | 1 |
| A | D | NULL | 1 |
| A | E | U | 1 |
| A | F | L | 1 |
| A | F | T | 1 |
| A | G | D | 1 |
| A | G | F | 1 |
| A | G | H | 1 |
| A | G | J | 1 |
| A | H | A3 | 1 |
| A | I | A2 | 1 |
| A | I | G | 1 |
| A | I | K | 1 |
| A | I | NULL | 1 |
| A | J | A1 | 1 |
| A | J | Z | 1 |
| A | X | W | 1 |
| B | A | A | 1 |
| B | A | B | 1 |
| B | A | C | 1 |
| B | A | E | 1 |
| B | A | I | 1 |
| B | A | N | 1 |
| B | A | V | 1 |
| B | A | Y | 1 |
| B | B | P | 1 |
| B | B | Q | 1 |
| B | B | R | 1 |
| B | E | U | 1 |
| B | F | L | 1 |
| B | F | S | 1 |
| B | F | T | 1 |
| B | G | D | 1 |
| B | G | F | 1 |
| B | G | H | 1 |
| B | G | J | 1 |
| B | I | A2 | 1 |
| B | I | G | 1 |
| B | I | K | 1 |
| B | J | NULL | 1 |
| C | A | A | 1 |
| C | A | B | 1 |
| C | A | C | 1 |
| C | A | E | 1 |
| C | A | I | 1 |
| C | A | V | 1 |
| C | A | Y | 1 |
| C | B | P | 1 |
| C | B | Q | 1 |
| C | B | R | 1 |
| C | E | U | 1 |
| C | F | L | 1 |
| C | F | T | 1 |
| C | G | D | 1 |
| C | G | F | 1 |
| C | G | H | 1 |
| C | G | J | 1 |
| C | I | A2 | 1 |
| C | I | G | 1 |
| C | I | K | 1 |
| C | J | NULL | 1 |
| C | J | Z | 1 |
(Code and output hidden; see .rmd for code)
Rows: 6,486
Columns: 5
$ Category1 <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
$ Category2 <chr> "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", …
$ Category3 <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ Date <date> 2010-12-01, 2011-01-01, 2011-02-01, 2011-03-01, 2011-04-01,…
$ Value <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Number of NA Dates: 0
(Code and output hidden; see .rmd for code)
Following the Exploratory Data Analysis (EDA), I plan to create time series objects from the following Category Code combinations:
My rationale:
I believe these combinations offer the best holistic chance of predicting future values due to their significant influence on the dataset.
# Assign data to ts_data (assuming 'data_wrangled' is already prepared)
ts_data <- data_wrangled
# Data Validation: Ensure essential structure for time series analysis
if (!is.data.frame(ts_data) || !any(tolower(names(ts_data)) == "date")) {
stop("Error: 'ts_data' is not a dataframe or 'date' column does not exist.") # Stop execution if the data is invalid
}
# Confirmation Message (if the code reaches here, validation checks have passed)
print("Check passed: 'ts_data' is a dataframe and contains a 'date' column.")[1] "Check passed: 'ts_data' is a dataframe and contains a 'date' column."
# Assumptions (Make sure these align with your data)
# - 'ts_data' is loaded with columns: Category1, Category2, Category3, Date, Value
# - Your goal is to create separate time series for different category combinations
# Parameters
start_date <- as.Date("2013-01-01")
end_date <- as.Date("2016-08-31")
frequency <- 12 # Monthly data
# Predefined Category Combinations
categories <- list(
c("A", "A", "M"),
c("B", "C", "W"),
c("C", "C", "W")
)
# Storage for Time Series and Output
ts_list <- list() # Stores time series objects
output_text <- list() # Stores text output for reporting
# Category Conversion and Verification Loop
for (i in seq_along(categories)) {
category <- categories[[i]]
# Filtering and Aggregation
filtered_data <- ts_data %>%
filter(Category1 == category[1], Category2 == category[2], Category3 == category[3],
Date >= start_date, Date <= end_date) %>%
group_by(Date) %>%
summarise(Value = sum(Value)) %>% # Ensure 'sum' is your intended aggregation
arrange(Date)
# Time Series Creation
ts_object <- ts(filtered_data$Value,
start = c(year(min(filtered_data$Date)), month(min(filtered_data$Date))),
frequency = frequency)
# Store Time Series with Descriptive Name
ts_list_name <- paste(category, collapse = "_")
ts_list[[ts_list_name]] <- ts_object
# Data Consistency Checks (with formatted output)
output_text[[ts_list_name]] <- capture.output({
formatted_start <- format(min(filtered_data$Date), "%Y-%m-%d")
formatted_end <- format(max(filtered_data$Date), "%Y-%m-%d")
cat("\n---------------------------------------------\n")
cat("Category combination: ", ts_list_name, "\n\n")
cat("Summary for this category:\n")
cat("- Date range in data: [", formatted_start, " - ", formatted_end, "]\n")
cat("- Time series length: ", length(ts_object), "\n")
cat("- Expected periods (unique dates): ", length(unique(filtered_data$Date)), "\n")
cat("- Data points used: ", nrow(filtered_data), "\n")
cat("---------------------------------------------\n")
})
}
# Display Verification Output
for (name in names(output_text)) {
cat("\nOutput for category combination:", name, "\n")
cat(output_text[[name]], sep="\n")
}
Output for category combination: A_A_M
---------------------------------------------
Category combination: A_A_M
Summary for this category:
- Date range in data: [ 2013-01-01 - 2016-08-01 ]
- Time series length: 44
- Expected periods (unique dates): 44
- Data points used: 44
---------------------------------------------
Output for category combination: B_C_W
---------------------------------------------
Category combination: B_C_W
Summary for this category:
- Date range in data: [ 2013-01-01 - 2016-08-01 ]
- Time series length: 44
- Expected periods (unique dates): 44
- Data points used: 44
---------------------------------------------
Output for category combination: C_C_W
---------------------------------------------
Category combination: C_C_W
Summary for this category:
- Date range in data: [ 2013-01-01 - 2016-08-01 ]
- Time series length: 44
- Expected periods (unique dates): 44
- Data points used: 44
---------------------------------------------
# Assumptions
# - 'start_date' has been defined previously
# - 'ts_list' contains a list of your time series objects
# Store Plots for Later Use
time_series_plots_list <- list()
# Create and Display Plots
for (name in names(ts_list)) {
ts_object <- ts_list[[name]]
# Generate Date Sequence for Consistent Plotting
date_seq <- seq(start_date, by = "month", length.out = length(ts_object))
# Create ggplot2 Time Series Plot
plot <- ggplot(data.frame(Time = date_seq, Value = as.numeric(ts_object)), aes(x = Time, y = Value)) +
geom_line() +
labs(title = paste("Time Series for:", name),
x = "Time",
y = "Value") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"))
# Store and Print Plot
time_series_plots_list[[name]] <- plot
print(plot)
}# Store STL Plots for Later Use
stl_plots_list <- list()
# STL Decomposition and Plotting
for (name in names(ts_list)) {
ts_object <- ts_list[[name]]
# Decompose Time Series (STL)
ts_stl <- stl(ts_object, s.window = "periodic", robust = TRUE)
# Create STL Plot with Enhanced Formatting
stl_plot <- autoplot(ts_stl) +
labs(title = paste("STL Decomposition for:", name),
x = "Time",
y = "Value") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
strip.text.x = element_text(size = 16, face = "bold"),
strip.background = element_rect(fill = "lightblue", colour = "deepskyblue", size = 1))
# Store and Print
stl_plots_list[[name]] <- stl_plot
print(stl_plot)
}STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data
STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data
STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data
# Store Seasonality Assessment Results
seasonality_output <- list()
# Analyze Seasonality of Time Series
for (name in names(ts_list)) {
ts_object <- ts_list[[name]]
# Decompose Time Series (STL)
stl_object <- stl(ts_object, s.window = "periodic")
# Measure Seasonality Strength (MAD)
seasonal_comp <- stl_object$time.series[, "seasonal"]
seasonal_mad <- mean(abs(seasonal_comp - mean(seasonal_comp)))
# Interpret Seasonality Strength
seasonality_assessment <- if (seasonal_mad > 0.2) {
"The time series exhibits significant seasonality.\n"
} else if (seasonal_mad > 0.1) {
"The time series exhibits some seasonality.\n"
} else {
"The time series likely does not exhibit significant seasonality.\n"
}
# Store Seasonality Analysis Report
seasonality_output[[name]] <- capture.output({
cat("\n---------------------------------------------\n")
cat("Time Series Analysis: ", name, "\n\n")
cat("Seasonality Assessment Summary:\n")
cat(sprintf("Mean Absolute Deviation (MAD) of the seasonal component: %.2f\n", seasonal_mad))
cat(seasonality_assessment)
cat("---------------------------------------------\n")
})
}
# Display Assessment Reports
for (name in names(seasonality_output)) {
cat("\nOutput for Time Series:", name, "\n")
cat(seasonality_output[[name]], sep="\n")
}
Output for Time Series: A_A_M
---------------------------------------------
Time Series Analysis: A_A_M
Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 3878397.13
The time series exhibits significant seasonality.
---------------------------------------------
Output for Time Series: B_C_W
---------------------------------------------
Time Series Analysis: B_C_W
Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 2342388.23
The time series exhibits significant seasonality.
---------------------------------------------
Output for Time Series: C_C_W
---------------------------------------------
Time Series Analysis: C_C_W
Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 2593021.80
The time series exhibits significant seasonality.
---------------------------------------------
# Store Results for Later
differencing_output <- list()
differenced_ts_list <- list()
# Differencing Analysis Loop
for (name in names(ts_list)) {
# Start with a Copy of Original Data
current_data <- ts_list[[name]]
max_iterations <- 5 # Set a maximum for differencing attempts
iterations <- 0
seasonal_period <- frequency
# Store Differencing Test Report
differencing_output[[name]] <- capture.output({
cat("\n---------------------------------------------\n")
cat("Performing Differencing Tests for: ", name, "\n\n")
# Seasonal Differencing (if applicable)
if (seasonal_period > 1) {
current_data <- diff(current_data, lag = seasonal_period)
iterations <- iterations + 1
cat("Seasonal differencing applied with lag =", seasonal_period, "\n")
# Update ts object after seasonal differencing
current_data <- ts(current_data, start = start(ts_list[[name]]), frequency = frequency)
}
# Iterative Regular Differencing
while (iterations < max_iterations) {
adf_result <- adf.test(current_data, alternative = "stationary")
# Stop if stationary
if (adf_result$p.value < 0.05) {
break
}
# Otherwise, difference and update
current_data <- diff(current_data)
iterations <- iterations + 1
cat(sprintf("After differencing %d times, p-value is %.5f \n", iterations, adf_result$p.value))
# Update ts object after differencing
current_data <- ts(current_data, start = start(ts_list[[name]]), frequency = frequency)
}
# Final Stationarity Assessment
if (adf_result$p.value < 0.05) {
cat(sprintf("Time Series %s appears stationary after %d differencing operations.\n", name, iterations))
} else {
cat(sprintf("Time Series %s is still non-stationary after maximum allowed differencing operations.\n", name))
}
cat("\n---------------------------------------------\n")
})
# Store Final Differenced Data
differenced_ts_list[[name]] <- current_data
}
# Display Test Reports
for (name in names(differencing_output)) {
cat("\nDifferencing Test Output for Time Series:", name, "\n")
cat(differencing_output[[name]], sep="\n")
}
Differencing Test Output for Time Series: A_A_M
---------------------------------------------
Performing Differencing Tests for: A_A_M
Seasonal differencing applied with lag = 12
After differencing 2 times, p-value is 0.55791
After differencing 3 times, p-value is 0.16701
Time Series A_A_M appears stationary after 3 differencing operations.
---------------------------------------------
Differencing Test Output for Time Series: B_C_W
---------------------------------------------
Performing Differencing Tests for: B_C_W
Seasonal differencing applied with lag = 12
After differencing 2 times, p-value is 0.47475
After differencing 3 times, p-value is 0.09616
Time Series B_C_W appears stationary after 3 differencing operations.
---------------------------------------------
Differencing Test Output for Time Series: C_C_W
---------------------------------------------
Performing Differencing Tests for: C_C_W
Seasonal differencing applied with lag = 12
After differencing 2 times, p-value is 0.21465
Time Series C_C_W appears stationary after 2 differencing operations.
---------------------------------------------
# Store Plots
differencing_plots_list <- list()
# Differencing and Plotting Loop
for (name in names(ts_list)) {
current_data <- ts_list[[name]]
max_iterations <- 5
iterations <- 0
seasonal_period <- frequency # Assuming 'frequency' is defined earlier
# Seasonal Differencing (if needed)
if (seasonal_period > 1) {
current_data <- diff(current_data, lag = seasonal_period)
iterations <- 1
}
# Iterative Regular Differencing
while (iterations < max_iterations) {
adf_result <- adf.test(current_data, alternative = "stationary")
if (adf_result$p.value < 0.05) {
break
}
current_data <- diff(current_data)
iterations <- iterations + 1
}
# Prepare for Plotting (adjusting for differencing)
date_seq <- seq(start_date, by = "month", length.out = length(current_data))
# Create and Store Differenced Time Series Plot
plot <- ggplot(data.frame(Date = date_seq, Value = as.numeric(current_data)), aes(x = Date, y = Value)) +
geom_line() +
labs(title = paste("Time Series after", iterations, "Differencing(s) for:", name),
x = "Date",
y = "Value") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
differencing_plots_list[[name]] <- plot
print(plot)
}Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations
Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations
Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations
# Assuming ts_list contains your time series objects
for (name in names(ts_list)) {
cat("\n---------------------------------------------\n")
cat(sprintf("Augmented Dickey-Fuller Test for: %s\n", name))
current_data <- ts_list[[name]]
# Perform the ADF Test
adf_result <- adf.test(current_data, alternative = "stationary")
# Explain Results Clearly
cat(sprintf("ADF Test Results for %s:\n", name))
cat(sprintf("Test Statistic: %.4f, P-value: %.4f\n", adf_result$statistic, adf_result$p.value))
cat(sprintf("Critical Values: %.4f (1%%), %.4f (5%%), %.4f (10%%)\n", adf_result$critical[["1%"]], adf_result$critical[["5%"]], adf_result$critical[["10%"]]))
# Interpret the Results (with guidance)
if (adf_result$p.value < 0.05) {
cat("Conclusion: The time series appears to be stationary.\n")
} else {
cat("Conclusion: The time series may still be non-stationary. Consider differencing to achieve stationarity.\n")
}
cat("\n---------------------------------------------\n")
}
---------------------------------------------
Augmented Dickey-Fuller Test for: A_A_M
ADF Test Results for A_A_M:
Test Statistic: -3.7219, P-value: 0.0346
Conclusion: The time series appears to be stationary.
---------------------------------------------
---------------------------------------------
Augmented Dickey-Fuller Test for: B_C_W
ADF Test Results for B_C_W:
Test Statistic: -2.4227, P-value: 0.4064
Conclusion: The time series may still be non-stationary. Consider differencing to achieve stationarity.
---------------------------------------------
---------------------------------------------
Augmented Dickey-Fuller Test for: C_C_W
ADF Test Results for C_C_W:
Test Statistic: -3.7192, P-value: 0.0348
Conclusion: The time series appears to be stationary.
---------------------------------------------
# Loop through time series in your list
for (name in names(ts_list)) {
cat(sprintf("Box-Ljung Test for: %s\n", name))
# Extract Time Series Data
current_data <- ts_list[[name]]
# Estimate Time Series Frequency (with check)
numeric_frequency_estimate <- frequency(current_data)
if (is.na(numeric_frequency_estimate)) {
stop("Numeric frequency estimation failed. Please check the calculations.")
}
# Determine Lag for Test (adjustable rule)
lag_for_test <- max(1, min(20, numeric_frequency_estimate)) # Adjust rule if needed
# Perform Box-Ljung Test (check residuals for autocorrelation)
box_test_result <- Box.test(current_data, lag = lag_for_test, type = "Ljung-Box")
# Display Test Results
print(box_test_result)
cat("\n---------------------------------------------\n")
}Box-Ljung Test for: A_A_M
Box-Ljung test
data: current_data
X-squared = 42.442, df = 12, p-value = 2.806e-05
---------------------------------------------
Box-Ljung Test for: B_C_W
Box-Ljung test
data: current_data
X-squared = 82.307, df = 12, p-value = 1.495e-12
---------------------------------------------
Box-Ljung Test for: C_C_W
Box-Ljung test
data: current_data
X-squared = 59.267, df = 12, p-value = 3.069e-08
---------------------------------------------
# Function to generate ACF and PACF plots for differenced data
generate_acf_pacf_plots <- function(ts_data, name) {
# Generate ACF plot
acf_plot <- forecast::Acf(ts_data, plot = FALSE)
# Generate PACF plot
pacf_plot <- forecast::Pacf(ts_data, plot = FALSE)
# Convert to ggplot objects using autoplot
acf_ggplot <- ggplot2::autoplot(acf_plot) +
ggtitle(paste("ACF for:", name)) +
theme_minimal()
pacf_ggplot <- ggplot2::autoplot(pacf_plot) +
ggtitle(paste("PACF for:", name)) +
theme_minimal()
return(list(acf_plot = acf_ggplot, pacf_plot = pacf_ggplot))
}
# Assuming 'differenced_ts_list' contains the final differenced data,
# Iterate over each time series to generate and arrange plots
for (name in names(differenced_ts_list)) {
differenced_data <- differenced_ts_list[[name]]
# Generate ACF and PACF plots
acf_pacf_plots <- generate_acf_pacf_plots(differenced_data, name)
# Arrange plots for visual comparison
gridExtra::grid.arrange(
acf_pacf_plots$acf_plot,
acf_pacf_plots$pacf_plot,
ncol = 2,
top = paste("Visual Diagnostics for Model Selection:", name)
)
}Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.
Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.
Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.
The ACF plot for the A,A,M time series shows statistically significant correlations extending out to around 12 lags. There are spikes at lags 1-3, 12, 13, and 24. The periodic spikes indicate potential seasonality at a 12 month period. The earlier spikes suggests there may be a moving average component or a short-memory autoregressive component.
The PACF plot cuts off decisively after lag 1 and then mostly stays between the significance bands, swinging around 0, characteristic of an AR(1) process. There is also a smaller spike at lag 12, providing further evidence of the 12-month seasonal cycle. The sharp cutoff supports that the ACF spikes are likely from a moving average process rather than higher order AR.
The Augmented Dickey-Fuller test showed this time series was non-stationary. After 1 seasonal difference and 3 consecutive regular differences, stationarity was achieved. This tells us:
The seasonality diagnostic gave a very high Mean Absolute Deviation (MAD) value of around 3.9 million for the seasonal component. This supports the ACF/PACF indications that there is a strong seasonal effect at 12 months.
Based on putting all the above analyses together:
Therefore, I would start with a SARIMA(1,1,3)(1,1,0)[12] model to forecast this time series.
The ACF plot for the B,C,W time series shows statistically significant correlations extending out to around lags 12-15. There are additional spikes at lags 1-3. The periodic seasonal spikes indicate a potential 12-month seasonal cycle. The earlier spikes suggests there may be a moving average component or a short-memory autoregressive component.
The PACF plot cuts off decisively after lag 1 and then oscillates around zero between the significance bands, characteristic of an AR(1) process. There is also a smaller spike at lag 12, providing further evidence of the 12-month seasonal pattern. The sharp cutoff supports that the ACF spikes are likely from an MA process rather than a higher order AR model.
The Augmented Dickey-Fuller test showed this time series was non-stationary initially. After applying 1 seasonal difference and 3 regular differences, stationarity was achieved. This tells us:
The seasonality check gave a high Mean Absolute Deviation (MAD) of around 2.3 million for the seasonal component. This confirms the hypotheses of seasonality at the annual periodicity.
Based on the above:
Therefore, I would start with fitting a SARIMA(1,1,3)(1,1,0)[12] model to this time series.
The ACF plot for the C,C,W series shows statistically significant correlations extending out to lags around 12-15. There are additional smaller spikes at lags 1-5. The periodic seasonal spikes indicate a potential seasonal cycle at 12 months. The earlier spikes suggest there could be a higher order autoregressive process.
The PACF plot shows a slow decay in the correlations without clearly cutting off. This suggests the ACF spikes are from a higher order AR rather than an MA process. There is also a spike at lag 12 relating to the seasonal pattern.
The Augmented Dickey-Fuller test showed this series was non-stationary initially. After applying 1 seasonal difference and 2 regular differences, stationarity was achieved. This tells us:
The seasonality check gave a high Mean Absolute Deviation (MAD) of 2.6 million for the seasonal component, pointing to a strong seasonal effect.
Based on the analyses:
Therefore, I would start with fitting a SARIMA(2,1,2)(1,1,0)[12] for this time series.
(Code and output hidden; see .rmd for code)
Series: ts_data
ARIMA(1,1,0)(0,1,0)[12]
Coefficients:
ar1
-0.7621
s.e. 0.1673
sigma^2 = 1.177e+13: log likelihood = -279.86
AIC=563.73 AICc=564.58 BIC=565.39
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -281927.6 2505690 1602520 223.335 323.0029 0.6869274 -0.3277715
Series: ts_data
ARIMA(1,1,0)(0,1,0)[12]
Coefficients:
ar1
-0.7782
s.e. 0.1702
sigma^2 = 4.501e+12: log likelihood = -271.72
AIC=547.44 AICc=548.3 BIC=549.11
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -24968.57 1549353 905330.8 67.48467 77.38088 0.6973285 -0.2433445
Series: ts_data
ARIMA(0,1,1)(0,1,0)[12]
Coefficients:
ma1
-0.8700
s.e. 0.1799
sigma^2 = 9.741e+11: log likelihood = -274.17
AIC=552.35 AICc=553.15 BIC=554.13
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 155812.9 730871.4 446412.1 211.4865 226.9794 0.5925095 0.01513287
The forecasting models for all three time series indicated high prediction errors on the training data, which suggests that the models may not be capturing the underlying patterns effectively. The significant seasonality detected in all series is properly addressed by the models, yet the high error metrics imply that further refinement of the models, possibly through parameter tuning or the inclusion of additional explanatory variables, could improve forecast accuracy. The findings call for a cautious approach to relying on these forecasts without further model improvements and validation against unseen data.
The time series analysis conducted on the A_A_M, B_C_W, and C_C_W series offered valuable insights into the underlying patterns and potential for forecasting. However, there are several avenues for improvement that could enhance the reliability and accuracy of future models:
Model Optimization: While the SARIMA models provided a starting point for analysis, further refinement is necessary. The high RMSE and MAPE values indicate that the models may not be capturing the underlying process adequately. More sophisticated model selection techniques or additional explanatory variables could improve performance.
Data Wrangling Efficiency: Given time
constraints, utilizing auto.arima can streamline the model
selection process after initial data wrangling. This function can
automate the identification of optimal model parameters, saving valuable
time and computational resources.
Time Series Duration: The current dataset spans four years, which may not capture longer-term cyclical behavior or structural changes in the data. Time series analysis often benefits from longer periods to discern between random fluctuations and true patterns. Future studies should consider extending the timeframe if data availability allows.
Additional Data: Incorporating more granular data or external variables could help to explain some of the variance not accounted for by the time series models alone. Economic indicators, market trends, or categorical events could provide further context for the fluctuations observed in the series.
Model Diagnostics: Post-modeling diagnostics play a critical role in validating the assumptions of the time series models. Checks for autocorrelation, non-normality, and heteroscedasticity in the residuals can signal the need for model adjustments or additional differencing.
Forecasting Evaluation: The accuracy of forecasts should be evaluated against a holdout sample or through time series cross-validation to provide a more robust measure of the model’s predictive capabilities.
Seasonality Adjustments: The significant seasonality indicated by the Mean Absolute Deviation (MAD) in the series underscores the need to refine how seasonal effects are accounted for, possibly through more complex seasonal models or transformation techniques.
Training Data Concerns: The high training errors observed suggest the models may not generalize well to unseen data. Future models should focus on improving the fit on the training data without overfitting, possibly through regularization techniques or model averaging.
Software and Tools: Upgrading to more advanced statistical software or leveraging machine learning tools may offer better functionality for modeling complex time series data and automating parts of the analytical process.
Collaborative Efforts: Time series forecasting can benefit from collaborative efforts, bringing together domain experts and data scientists to ensure that models are not only statistically sound but also grounded in real-world phenomena.
By addressing these areas, the predictive power and reliability of time series models used in future analyses can be significantly improved, leading to more accurate forecasts and better-informed decision-making.